Appendix B. Common Misapplications of Statistics

This appendix explains the mistakes that frequently appear in statistical reports. It also summarizes an appropriate alternative to the unacceptable practice. This information can help regulators, consultants, and stakeholders better understand how to apply statistics to groundwater data sets. The problems and errors below can occur during planning, implementation, or both (as noted for each).

B.1 Statistical Error and Resolution

Problem/Error (Planning/Implementation): Concluding that, if the sample maximum is less than a decision criterion, this is likely conservative from a risk perspective.

It is not necessarily true that the study area contamination is less than the decision criterionGeneral term used in this document to identify a groundwater concentration that is relevant to a project; used instead of designations such as Groundwater Protection Standard, clean-up standard, or clean-up level. because the sample maximum is less than the decision criterion. The study area population meanThe arithmetic average of a sample set that estimates the middle of a statistical distribution (Unified Guidance). may be greater than the decision criterion when the sample maximum is less than the decision criterion, depending on the nature of the distribution and sample size.

Recommendation: Use a one-sample hypothesis test.
Problem/Error (Implementation): Concluding there is necessarily a problem because one grab sample result exceeds the decision criterion.

It is sometimes concluded that study area contaminant concentrations are elevated (and therefore additional remedial activities are needed) because one or more grab samples exceed the criterion. This conclusion is not necessarily valid. The study area mean may be less than the criterion, even when individual grab concentrations exceed the criterion (similar to Misapplication 1, above.)

Recommendation: As part of a systematic planning process, determine whether numerical goals will be treated as “ceilings,” simple averages, or other. Use a one-sample hypothesis to make inferences about the study area mean.
Problem/Error (Planning/Implementation): Comparing the site sample maximum with a background threshold (for example, the background maximum or mean) without regard to potential decision errors.

The site sample maximum is being compared with the backgroundNatural or baseline groundwater quality at a site that can be characterized by upgradient, historical, or sometimes cross-gradient water quality (Unified Guidance). sample maximum to determine if the site contamination is elevated relative to background. In general, do not compare maximums from the two data sets to make inferences about the means of the data sets as this does not control decision errors and can result in false positives.

Recommendation: Use two-sample hypothesis tests or compare the study area results with background upper prediction limits (UPLs).
Problem/Error (Implementation): Comparing the site sample maximum after a large number of site samples have been collected to a background 95% upper tolerance limit (UTL) and concluding that exceeding the UTL necessarily means the site concentration is elevated relative to the background concentration.

This approach does not control false positives. The probability of false positives approaches 100% as the numbers of study area results increase as when, for example, there are many samples for multiple analytes.

Recommendation: Use a two-sample hypothesis test.
Problem/Error (Planning/Implementation): Using UTLs rather than upper prediction limits (UPLs) when false positives primarily need to be controlled.

Because only a small number of study area results were collected, but a large number of background results were available, hypothesis tests could not be performed to compare study area concentrations to background concentrations; instead the study area results were compared with UTLs. When background and site concentrations are not different from one another, all the study area results will be less than the background UPL with some specified level of confidence. For example, when the background threshold is a 95% UPL, there is a 95% chance 100% of the study area results will be less than the background threshold. However, only a proportion of the study area results will be less than the background UTL at the specified level of confidence. For example, when the background threshold is a “95% UTL with 95% coverage” (typically, referred to as a “95% UTL”), there is a 95% chance 95% (not 100%) of the study area results will be less than the background UTL.

Recommendation: To best control false positives, use background UPLs rather than UTLs. Compare the k study area results (k is a variable representing the number of study area results) to the UPL for the next k future observations. The UPL depends on the number of study area results k that will be compared and increases as k increases.
Error (Planning): Assuming that reliable inferences (decisions) can be made based on very small sample sizes (for example, n < 5).

There are many variations of this problem. For example, attempting to calculate an exposure point concentration as a 95% upper confidence limit (UCL)The upper value on a range of values around the statistic (for example, mean) where the population statistic (for example, mean) is expected to be located with a given level of certainty, such as 95% (science-dictionary.org 2013). when the sample size is small (for example, n = 6), or, in the extreme case, comparing one measurement to the decision criterion, for example, the reporting requirement for polychlorinated biphenyls (PCBs) under the Toxic Substances Control Act (TSCA). Reliable inferences about the study area mean cannot generally be made when the sample sizes are small. The nature of the underlying measurement distribution cannot be reliably determined for example, to calculate a 95% UCL for exposure point concentrations for the risk assessment. The 95% UCL depends on the number of samples taken; it can be highly unstable in small data sets or in data sets with larger variation. The solution is to avoid undersampling by using a systematic planning process.

Recommendation: Follow a systematic planning process, such as the seven-step DQO process, and establish tolerances for Type I (false positiveIn hypothesis testing, if the null hypothesis (H₀) is true but is rejected in favor of the alternate hypothesis (Hᴀ) which is not true, then a false positive (Type I) error has occurred (Unified Guidance).) and Type II (false negativeIn hypothesis testing, if the alternative hypothesis (Hᴀ) is true but is rejected in favor of the null hypothesis (H₀) which is not true, then a false negative (Type II, β) error has occurred (Unified Guidance).) decision errors. Understand the requirements and limitations of the statistical methods that will be applied to the sample results (see Section 5 and Section 3) before deciding how many samples to collect. As a "rule of thumb," 8 to 10 measurements are often needed to do statistical evaluations, however, a larger sample size will likely be needed if the data set is very skewed or contains censored values (nondetects).The rule of thumb should not substitute for planning, determining what sample size is appropriate for the particular application, and documentation.
Problem/Error (Implementation): Comparing sequential measurements in time with one another on a sample-by-sample basis to determine if there are increasing or decreasing trends.

This error may fail to distinguish random variability from long term changes in concentration. Statistical tests are needed to distinguish random variability from long term changes in concentration.

Recommendation: Present time series plots with the results of statistical trend analyses (for example, p-values from Mann-Kendall tests).
Problem/Error (Implementation): Substituting arbitrary multiples of the reporting limit for nondetects for statistical evaluations rather than treating the nondetects as inequalities.

This error distorts the data sets and can produce erroneous conclusions.

Recommendation: Do not substitute arbitrary multiples of the reporting limits for nondetectsLaboratory analytical result known only to be below the method detection limit (MDL), or reporting limit (RL); see "censored data" (Unified Guidance). for statistical evaluations (for example, ½ the detection limit) without considering the impact on the particular statistical evaluation to be performed (see Section 5.7.5). Treat nondetects as inequalities and use nonparametricStatistical test that does not depend on knowledge of the distribution of the sampled population (Unified Guidance). statistical methods that do not rely on substitution of surrogates values for nondetects (for example, Kaplan-Meier methods). Section 5.7 includes discussion of managing nondetect data.
Problem/Error (Implementation): Using an incorrect censoring limit; for example, reporting nondetects to the MDL in 40 CFR Part 136 Appendix B (which was designed to minimize Type I, false positive, error).

The reporting limits for nondetects need to minimize Type II (false negative) and not Type I error (given the null alternative hypothesis μ ≤ 0). This can underestimate contamination (resulting in false negatives) when comparing the nondetects to risk-based criteria.

Recommendation: Report nondetects to the laboratory's quantitation limit or to a smaller reporting limit that was otherwise demonstrated to minimize false negatives (for example, the Limits of Detection (LOD) defined in the DOD Quality System Manual (DOD 2013).
Error (Planning/Implementation): Using arbitrary decision rules; for example, concluding that groundwater is clean when three consecutive rounds of results are less than a criterion.

Arbitrary decision rules do not control decision errors.

Recommendation: Use a systematic planning process, such as the seven-step DQO process, to develop a statistical approach that controls decision errors to an acceptable degree. Then use one-sample hypothesis tests.
Problem/Error (Implementation): Failing to use "like statistics" in comparisons.
There are many variations of this problem. For example:
- Comparing the mean of Site A to some percentile of Site B
- Comparing the percentile of Site A to the same percentile of Site B and concluding on this basis that the mean of Site A is larger than the mean of Site B
- Comparing composite sample results from a study area (for example, which estimate the mean) to a background UPLupper prediction limit or UTLupper tolerance limit determined from grab samples
- Comparing the 95% UCLupper confidence limit of the mean from a site with few sample results to the 95% UCL of the mean from a site with many sample results
Even when the study area and background area possess the same underlying population, the percentiles of these distributions will differ. Replicate grabs will produce a distribution of measurements (x_i), but replicate composites, each prepared from n grabs, will produce the distribution of sample means of sample size n; that is, the distribution for the statistic m_i = (x_i1 + x_i2 + … x_i_n)/n (versus the distribution for individual measurements x_i). Therefore, it is inappropriate to compare composite percentiles or individual composite results with grab percentiles.

For example, assuming both population distributions are normal, comparison of a study area composite to the 95th percentile of the distribution of background grabs is similar to comparing the 50th percentile (medianThe 50th percentile of an ordered set of samples (Unified Guidance)./mean) of the study area with the 95th background percentile. Comparing unlike statistical parameters leads to unpredictable mistakes in decisions.

Recommendation: Understand the nature of the statistical parameters being compared. For example, as the number of samples increase, the 95% UCL will converge on the mean of the population but the 95% UTL converges to the 95th percentile. The 95% UCL and the UTL are therefore not comparable. In environmental applications, often the risk is calculated using the mean concentrations of the site but regulatory requirements may be based on a parameter in the upper tail of the distribution. A systematic planning process will identify the appropriate parameter.
Problem/Error (Implementation): Failure to check assumptions required for statistical tests.
- Performing regression fits without testing the underlying assumptions required for these fits to be appropriate (for example, linearity, normality of the residuals, and constant standard deviation for the residuals).
- Assuming a distribution without testing the result to check whether it is reasonable. For example, assuming measurements are normal or lognormalA dataset that is not normally distributed (symmetric bell-shaped curve) but that can be transformed using a natural logarithm so that the data set can be evaluated using a normal-theory test (Unified Guidance). without testing for normality/lognormality.
- Assuming all temporal trends can necessarily be modeled by a simple equation of the form y = at + b (thus ignoring periodic trends) and not investigating whether other models are more appropriate.
Recommendation: Use exploratory data analysis (EDA) techniques. Routinely graph data (box plots, scatter plots, and histograms) to qualitatively evaluate the distributions of the results. Use goodness-of-fit tests before doing other statistical tests. For time series data, investigate whether non-linear fits are more appropriate; for example, consider equations of the following form: y = a t + b x + c sin(2 π t) + d cos(2 π t) where y denotes concentration, t the time, and a, b, c and d are constants.

Avoid software applications that do not emphasize using the correct test for the distribution.
Problem/Error (Implementation): Using correlation as the sole criterion to evaluate the comparability of two different sampling or analytical methodologies. Believing that a correlation coefficient near one for two different methods means the two methods give comparable data.

A correlationAn estimate of the degree to which two sets of variables vary together, with no distinction between dependent and independent variables (USEPA 2013b). coefficient near one indicates the results are correlated, but does not mean that they are comparable.

Recommendation: Use appropriate statistical tests for paired data, for example, refer to the U.S. Army Corps of Engineers guidance (USACE 1998).
Error (Implementation): Failing to account for the uncertainty when determining a functional relationship between two methods/variables such as Y = F(X), so that when a measurement X is taken (for example, using a low cost method), Y can be predicted.

This problem becomes more common as additional, less expensive analytical techniques are introduced. Here is an example: The study area was sampled with an inexpensive field method (giving results denoted by X). A limited number of split samples were also analyzed with a fixed-laboratory method (giving results denoted by Y). The field method was highly positively correlated with the lab method, giving a relationship Y = F(X) (for example, Y = a X + b). The field method was then used to determine if portions of the study area are "clean" or "dirty." In effect, it was assumed that the field method produced reliable, definitive data for decision making using the relationship Y = F(X). However, when X was measured to obtain Y, the uncertainty of the calculated value of Y was not reported or taken into account.

Recommendation: In the example above, a prediction interval for the fit Y = F(X) should have been calculated, as the field method was being used to determine extent of contamination by comparing individual measurements to the cleanup criterion.
Problem/Error (Planning): Setting the null hypothesis as H₀: μ(site) > μ(background) when comparing study area concentrations to background concentrations.

This error may be committed out of consideration of the “precautionary principle.” However, it requires the data user to show that site concentrations are less than background concentrations. As a practical matter, this condition would be nearly impossible even if the site were not contaminated.

Recommendation: Set the null hypothesisOne of two mutually exclusive statements about the population from which a sample is taken, and is the initial and favored statement, H₀, in hypothesis testing (Unified Guidance). as "H₀: μ(site) < μ(background)." Note that in ProUCL, when using two sample hypothesis test for site and background data, the default null hypothesis H₀ is consistent with this recommendation.
Problem/Error (Implementation): Discarding outliers solely on the basis of a statistical outlier test without presenting any physical justification, especially for a background data set (thereby biasing the results).

Obviously, discarding outliersValues unusually discrepant from the rest of a series of observations (Unified Guidance). will biasSystematic deviation between a measured (observed) or computed value and its true value. Bias is affected by faulty instrument calibration and other measurement errors, systematic errors during data collection, and sampling errors such as incomplete spatial randomization during the design of sampling programs (Unified Guidance). the results of statistical tests using the censored dataValues that are reported as nondetect. Values known only to be below a threshold value such as the method detection limit or analytical reporting limit (Helsel 2005).. If a data point is a statistical outlier to the rest of the data set, however, it is not necessarily incorrect or unrepresentative. This principle applies whether the data set is from a potentially contaminated site or from a reference background site.

Recommendation: Retain the outliers unless there is a strong weight of physical evidence to remove them. This decision is not a purely statistical consideration and should be made in the context of a systematic planning process. Document why these samples are likely not physically representative of anthropogenic or natural background conditions or why the quality of laboratory analytical results is substandard.
Problem/Error (Implementation): Failing to distinguish between a statistically significant result and a result of no practical significance.

A test of normality will detect very small deviations from normality when the sample size (that is, the number of samples) n is large, but for the end data use the small deviation from normality may not be of any practical importance. Be sure to determine both whether a difference exists and the magnitude of the difference.

Example: As reported, a test of normality detected a very small deviation from normality as the sample size n was very large. However, the normal probability plot and histogram indicated that the distribution is essentially normal. Also, the one-sample hypothesis test is being done to determine if the mean is greater than a decision criterion. Under the Central Limit TheoremStates that given a distribution with a mean, μ, and variance, σ², the sampling distribution of the mean approaches a normal distribution with a mean, μ, and a variance σ²/N as N, the sample size, increases (USEPA 2010)., the large value of n and near normal distributionSymmetric distribution of data (bell-shaped curve), the most common distribution assumption in statistical analysis (Unified Guidance). of the individual measurements, normality can be assumed to use a one-sample t-testA t-test, or two-sample test, is a statistical comparison between two sets of data to determine if they are statistically different at a specified level of significance (Unified Guidance)..

Recommendation: Use graphical methods to determine the reasonableness of tests for normality, lognormality, or other distribution. When statistically significant differences are detected, assess the magnitude of these differences in the context of project’s objectives. Document decisions related to the appropriateness of statistical tests.
Problem/Error (Implementation): Concluding that the failure to reject the null hypothesis "proves" the null hypothesis (when the power of the test is not also addressed).

Typically, large random variability exists in environmental data. A trend or difference in concentration could be relatively small. Failure to reject the null hypothesis does not prove the null hypothesis if the statistical test is of insufficient powerSee "statistical power.".

Recommendation: Use care when stating a conclusion based on failure to reject the null hypothesis (for example, “The null hypothesis could not be confidently rejected.”). See Rong, Y. (2011).

B.2 References

DOD (Department of Defense). 2013. Quality Systems Manual (QSM) for Environmental Laboratories Based on ISO/IEC 17025:2005(E) and The NELAC Institute (TNI) Standards. Volume 1, (September 2009) DOD Quality System Manual Version 5.0.

Rong, Y. 2011. "Statistical Methods and Pitfalls in Environmental Data Analysis." Practical Environmental Statistics and Data Analysis 10:243-258.

United States Army Corp of Engineers (USACE). 1998. "Environmental Quality, Technical Project Planning (TPP) Process." EM 200-1-2. Washington, D.C.: Department of the Army.

Publication Date: December 2013

Permission is granted to refer to or quote from this publication with the customary acknowledgment of the source (see suggested citation and disclaimer).